distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame.
val Data = Seq(
("James", "Sales", 3000),
("Michael", "Sales",
4600),
("Robert", "Sales", 100),
("Maria", "Finance", 3000),
("James",
"Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance",
3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif",
"Sales", 4100))
val df = Data.toDF("employee_name", "department", "salary")
Select department name form DataFrame
df.select("department").show()
We can use distinct() function to remove the duplicate rows of a
DataFrame and get the DataFrame which won’t have duplicate rows.
df.select($"department").distinct().show
You can also use dropDuplicates to get unique values,
We can use dropDuplicates operation to drop the duplicate rows of a DataFrame and get the DataFrame which won’t have duplicate rows.
df.select($"department").dropDuplicates().show
Count the unique records from DataFrame
#pyspark
Data = [ ("James", "Sales", 3000), \
("Michael", "Sales", 4600), \
("Michael", "Sales", 4600), \
("Michael", "Sales", 4600), \
("Robert", "Sales", 100), \
("Maria", "Finance", 3000), \
("James", "Sales", 3000), \
("Scott", "Finance", 3300), \
("Jen", "Finance", 3900), \
("Jen", "Finance", 3900), \
("Jeff", "Marketing", 3000), \
("Kumar", "Marketing", 2000), \
("Kumar", "Marketing", 2000), \
("Kumar", "Marketing", 2000), \
("Saif", "Sales", 4100)]
columns= ["empno","job", "sal"]
df = spark.createDataFrame(data = Data, schema = columns)
df.count()
df.count()
Out[26]: 15
df.distinct().count()
Out[27]: 9
No comments:
Post a Comment